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Abstract 

, We study the new problem of Huffman-like codes subject to individual restrictions on the code-word lengths 

of a subset of the source words. These are prefix codes with minimal expected code-word length for a random 
source where additionally the code-word lengths of a subset of the source words is prescribed, possibly differently 

' for every such source word. Based on a structural analysis of properties of optimal solutions, we construct an 

' efficient dynamic programming algorithm for this problem, and for an integer programming problem that may be 

■ of independent interest. 



I. Introduction 

We are given a random variable X with outcomes in a set of x = {%i, x n } of source words with associated 
probabilities P(X = Xi) = pi with p\ > p2, > p n > 0, and a code word subset of £1 = {0, 1}*, the set of 
finite binary strings. Let l(y) denote the length (number of bits) in y G f2. Let I = {i : Xi £ %}. 

The optimal source coding problem is to find a 1:1 mapping c : \ — > f2, satisfying c(xi) is not a proper prefix 
of c(xj) for every pair Xi,Xj G x, such that C C (X) = Yliel PiK c ( x i)) i s minimal among all such mappings. 
CNl | Codes satisfying the prefix condition are called prefix codes or instantaneous codes. 



> 



\Q , This problem is solved theoretically up to 1 bit by Shannon's Noiseless Coding Theorem [13], and exactly 

■--»« ■ and practically by a well-known greedy algorithm due to Huffman [10], which for n source words runs in 0(n) 

Q ' steps, or O(nlogn) steps if the p^s are not sorted in advance. If Co achieves the desired minimum, then denote 



X 



C(X)=C C0 (X). 

We study the far more general question of length restrictions on the individual code words, possibly different 
for each code word. This problem has not been considered before. The primary problem in this setting is the 
problem with equality lengths restrictions, where we want to find the minimal expected code-word length under 
the restriction of individually prescribed code-word lengths for a subset of the code words. Apart from being 
a natural question it is practically motivated by the desire to save some part of the code tree for future code 
words, or restrict the lengths of the code words for certain source words to particular values. For example, in 
micro-processor design we may want to reserve code-word lengths for future extensions of the instruction set. No 
polynomial time algorithm was known for this problem. Initially, we suspected it to be NP-hard. Here, we show 
an 0(n 3 ) dynamic programming algorithm. This method allows us to solve an integer programming problem 
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that may be of independent interest. The key idea is that among the optimal solutions, some necessarily exhibit 
structure that makes the problem tractable. This enables us to develop an algorithm that finds those solutions 
among the many possible solutions that otherwise exhibit no such structure. Formally, we are given length 
restrictions {k : i 6 /}, where the ij's are positive integer values, or the dummy _L, and we require that the 
coding mapping c satisfies l(c(xi)) — k for every i £ I with k ^_L. For example the length restrictions 
1, 2, _L, . . . , _L mean that we have to set l(c(xi)) = 1 and l(c(x 2 )) = 2, say c(xi) = 1 and c(x 2 ) = 01. Then, 
for the remaining Xj's the coding mapping c can use only code words that start with 00. We assume that the 
length restrictions satisfy (1) below, the Kraft's inequality [8], 

5^2-'' <1, (1) 

iei 

where we take k — oc for k =_L, since otherwise there does not exist a prefix code as required. 

Related Work: In [9], [5], [12], [3], [7] a variant of this question is studied by bounding the maximal 
code-word length, which results in a certain redundancy (non-optimality) of the resulting codes. In [2] both the 
maximal code-word length and minimal code-word length are prescribed. 

II. Noiseless Coding under Equality Restrictions 

Shannon's Noiseless Coding Theorem [13] states that if H(X) — ~^2 ieI Pi log l/pi is the entropy of the 
source, then H(X) < C(X) < H(X) + 1. The standard proof exhibits the Shannon-Fano code achieving this 
optimum by encoding Xi by a code word c(xi) of length l(c(xi)) — [logl/p^] (i e I). Ignoring the upper 
rounding to integer values for the moment, we see that C(X) = H(X) for a code that codes Xi by a code word 
of length logl/pi. This suggests the following approach. 

Suppose we are given length restrictions {k : i e I}. Let L = {i e I : li ^-L} be the set of equality 
length restrictions, and let C(X 7 L) be the minimal expected code-word length under these restrictions given the 
probabilities. Similar to Shannon's noiseless coding theorem, we aim to bound the minimal expected code-word 
length under equality restrictions below by an entropy equivalent H(X,L) < £(X,L) < H(X,L) + 1 where 
H(X,L) corresponds to the best possible coding with real-valued code-word lengths. Define 

qi = 2~ h for i e L. (2) 

If we define qi's also for the Xi's with i e I — L such that X^gj Qi = 1' then altogether we obtain a new 
probability assignment qi for every Xi (i 6 /), which has a corresponding Shannon-Fano code with code lengths 
l(c(xi)) = logl/^i for the x/s. Moreover, with respect to the probabilities induced by the original random 
variable X, and simultaneously respecting the length restrictions, the minimum expected code word length of 
such a g-based Shannon-Fano code is obtained by a partition of 

into qi's (i £ I — L) such that ^2 ieI Pi log 1/qi is minimized. Clearly, the part ^2 ieL Pi log 1/qi cannot be 
improved. Thus we need to minimize S = J2iei-LPi 1°§ over a ^ partitions of Q = J2iei-L H mt0 9 i S - 
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The partition that reaches the minimum S does not change by linear scaling of the pi's. Hence we can argue 
as follows. Consider S' = (1 — Q) log 1/(1 — Q) + J2iei-L 1i l°g V?* sucn triat ^" I s me entropy of the set of 
probabilities {1 — Q,qi : i E I — L}. Denote 

and define 

g? = Q Pi for i G / - L. (5) 

Then, both S and S" with the g, = q® (i £ I — L) reaches their minimum for this partition of Q. 

Lemma 1: Assume the above notation with with L,P,Q determined as above. The minimal expected prefix 
code length under given length restrictions is achieved by encoding Xi with code length log 1/((Q/P)pi) for 
alii el -L. 

Let us compare the optimal expected code length under length constraints with the unconstrained case. The 
difference in code length is 

^2,Pi logl/% - logl/pi = logPi/%, 

iei iei iei 

the Kulback-Leibler divergence D(p || q) between the p-distribution and the g-distribution [6]. The KL- 
divergence is always nonnegative, and is only if pi = qi for all i E I. For the optimum ^-distribution 
determined in Lemma 1 for the index set I — L we can compute it explicitly: 

J> l0S «7k = Pl0 4 

LEMMA 2: Given a random source X with probabilities P(X = xi) = pi (i e /), length restrictions 
{U : i E /} and with L,P,Q determined as above. Then, the minimum expected constrained code length is 

H (X, L) = Y,Pi log VPi + E Pi lo S ^77 + P lo § I ' 

which equals the minimal expected unconstrained code word length ^ ieI Pi log 1/pt only when % = pi for all 
i E /. 

Thus, the redundancy induced by the equality length restrictions is 

H(X, L) - H{X) = Y,Pi log 5- + Plog ^. 

Note that, just like in the unconstrained case we can find a prefix code with code word lengths [log 1/pi] , showing 
that the minimal expected integer prefix-code word length is in between the entropy H(X) and H(X) + 1, the 
same holds for the constrained case. There, we constructed a new set of probabilities with entropy H(X, L), 
and for this set of probabilities the minimal expected integer prefix-code word length is in between the entropy 
H (X, L) and H(X, L) + 1 by the usual argument. 

Example 1 : Let us look at an example with probabilities X = (0.4, 0.2, 0.2, 0.1, 0.1) and length restrictions 
L = (_L, 2, 2, 2, _L). The entropy H(X) w 2.12 bits, and the, non-unique, Huffman code, without the length 
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restrictions, is 0, 10, 110, 1110, 1111, which shows the minimal integer code-word length of C{X) = 2.2 bits, 
which is « 0.08 bits above the noninteger lower bound H{X). The redundancy excess induced by the equality 
length restrictions is H(X,L) — H(X) rss 0.24 bits, which shows that the integer minimal average code-word 
length C(X, L) is in between H(X, L) w 2.36 bits and H(X, L)|ls 3.36 bits. The actual optimal equality 
restricted code, given by Algorithm A below, is 111, 10, 01, 00, 110 with C(X, L) = 2.5, which is w 0.14 bits 
above the noninteger lower bound H(X, L). 

III. Optimal Code under Equality Restrictions 

Above we have ignored the fact that real Shannon-Fano codes have code-word length [logl/p»] rather than 
logl/pi. This is the reason that H{X) < C(X) < H(X) + 1 in the unconstrained case, leaving a slack of 
1 bit for the minimal expected code word length. The Huffman code is an on-line method to obtain a code 
achieving C(X). Gallager [4] has proved an upper bound on the redundancy of a Huffman code, C(X) — H(X) 
of p n + log[(2 log e) / e] which is approximately p n + 0.086, where p n is the probability of the least likely source 
message. This is slightly improved in [5]. Our task below is to find a Huffman-like method to achieve the 
minimal expected code-word length C(X, L) in the length-constrained setting. Our goal is to come as close to 
the optimum in Lemmas 1, 2 as is possible. 

A. Free Stubs 

Input is the set of source words x\, . . . , x n with probabilities p\, . . . ,p n and length restrictions l\, . . . , l n that 
should be satisfied by the target prefix code c in the sense that l(c(xi)) = k for all 1 < i < n except for the 
i's with li =_L for which i's there are no code word length restrictions. Let I = {1, . . . ,n}, L — {i : k ^_L}, 
and P = I - L. Denote 

M=\P\ = \{ Pi :ieI-L}\ 

For convenience in notation, assume that the source words xi, . . . ,x n are indexed such that L = {1, 2, . . . , k} 
with l\ < I2 < ■ . . < Ik- If h = h+i f° r some i we set L := L — {i, i + 1}; L := L \J{(i, i + 1)}; k := k — 1; 
h,i+i '■= h — !• We repeat this process until there are no equal lengths left, and finish with l[ < 1' 2 < . . . < l' k , 
with I'/s the resulting lengths. That is, we just iteratively merge two nodes which are at the same level in the tree. 
Therefore, the problem is reduced to considering a code tree with forbidden code-word lengths l[ (1 < i < k'), 
the largest forbidden length l' k , leading to a forbidden node with a free sibling node at the end of a path at the 
same length from the root. That is, a code word tree with a single available free node at each level h, satisfying 
1 < h < l' k , and h ^ l\ (1 < i < k' — 1). Each h corresponds to a path of length h leading from the root to a 
node n(h) corresponding to a code-word prefix that is as yet unused. We call such a node (and the path leading 
to it or the corresponding code-word prefix) a free stub. Denote the set of levels of these free stubs n(h) by H, 
and let m = \H\. Without loss of generality, 

U = {h k : 1 < k < to} 
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with hi < h 2 < . . . < h m . We now have to the find a code-word tree using only the free stubs, such that the 
expected code-word length is minimized. We can do this in the straightforward manner, dividing the probabilities 
in P among the m free stubs, and computing the minimal expected code-word length tree for the probabilities 
for every stub for each of those divisions, and determining the division giving the least expected code-word 
length. We can use Huffman's construction since it doesn't depend on the probabilities summing to 1. There 
are m M possible divisions, so this process involves computing m M Huffman trees — exponentially many unless 
m = 1 which is the unrestricted common Huffman case. 

B. Reduction 

Let level(p) denote the number of edges in a path from the leaf node labeled by p to the root. So level(root) = 
0. Then, if a tree is optimal (has least expected code-word length), then 

if level(p) < level(q) then p>q 1 (6) 

since otherwise the expected code-word length can be decreased by interchanging p and q. If T is a prefix-code 
word tree for the source words under the given length restrictions, then the subtree T t is the subtree of T with 
the free stub n(hi) as its root (1 < i < m). 

Lemma 3 : There is a tree T with minimal expected code-word length such that if i < j then p > q for all 
p in Ti and q in Tj. 

Proof: Suppose the contrary: for every optimal tree T, there are p < q with p in Tj and q in Tj for some 
i < j. Fix any such T. By (6), level(p) > level(q). Let T P:Q be the subtree with root at level(q) containing the 
leaf p. Then we can interchange q and T Piq without changing the expected code-word length represented by the 
tree. This idea leads to the following sorting procedure: Repeat until impossible: find a least level probability, and 
if there are more than one of them a largest one, that violates the condition in the lemma, and interchange with 
a subtree as above. Since no probability changes level the expected code-word length stays invariant. In each 
operation a least level violating probability moves to a lower indexed subtree, and the subtree it is interchanged 
with does not introduce new violating probabilities at that level. The transformation is iteratively made from 
the top to the bottom of the tree. This process must terminate with an overall tree satisfying the lemma, since 
there are only a given number of probabilities and indexed subtrees. ■ 
For ease of notation we now assume that the source words xi, . . . ,x n are indexed such that the unrestricted 
source words are indexed xi, . . . ,xm with probabilities pi > p 2 > ■ ■ ■ > pm- The fact that we just have to 
look for a partition of the ordered list of probabilities into m segments, rather than considering every choice 
of m subsets of the set of probabilities, considerably reduces the running time to find an optimal prefix code. 
Consider the ordered list p 1 > ■ ■ ■ > p M . Partition it into m contiguous segments, possibly empty, which gives 
C^m-i 1 ) partitions. We can reduce this number by noting that in an optimal tree the free stubs are at different 
heights, and therefore each of them must have a tree of at least two elements until they have empty trees from 
some level down. Otherwise the tree is not optimal since it can be improved by rearranging the probabilities. 
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Therefore, we can restrict attention to partitions into < m segements, which contain at least two elements. There 
are at most ( M m m ) such partitions. For each choice, for each set of probabilities corresponding to a zth segment 
construct the Huffman tree and attach it to the zth free stub, and compute the expected code-word length for 
that choice. A tree associated with the least expected codeword length is an optimal tree. Thus, we have to 
construct at most ( M ~ m ) Huffman trees, which is polynomial in M for fixed to, and also polynomial in M for 
either to or M — m bounded by a constant. 

C. Polynomial Solution 

From each partition of p\ > . . . > pu into m segments (possibly empty) consisting of probabilities Pk for 
the kth segment, we can construct trees Ti,T 2 , . . . ,T m with Tk having the probabilities of Pk as the leaves, 
and free stub n(hk) as the root. Clearly, if T is an overall tree with minimum expected code-word length, then 
each subtree with the free stub n(hk) as root considered in isolation, achieves minimal expected code-word 
length over the probabilities involved. We want to find the optimal partition with a minimum amount of work. 
Note that, from some s < m on, every subtree Tk with s < k < m may be empty. 

For every tree T, not necessarily optimal, let Lt denote the expected code-word length for the probabilities 
in P according to tree T. Define H[i,j, k] to be the minimal expected code-word length of the leaves of a tree 
T[i, j, k] constructed from probabilities pi, . . . ,pj (i < j) and with a singlefold path from the root of T to the 
free stub node n(hk), and subsequently branching out to encode the source words (probabilities) concerned. 
Then, 

H[i,j,k] = Pr{h k + l{p r )) (7) 

i<r<j 

each probability p r labeling a leaf at the end of a path of length hk + l(p r ) from the root of T, the first part 
of length hk to the free stub n(hk), and the second part of length l(p r ) from n(hk) to the leaf concerned. For 
a partition of the probability index sequence 1,...,M into to (possibly empty) contiguous segments [ik,jk] 
(1 < k < to), inducing subtrees T[ik,jk, k] using free stubs hk accounting for expected code-word length 
H[ik,jk, k], we obtain a total expected code-word length for the overall tree T of 

Lt= Yl H i i kJk,k}. 

l<k<m 

Let us now consider the expected code word length of a tree T' which consists of tree T with a subset of 
subtrees Tk removed and the corresponding probabilities from the overall probability set P. Removing subtree 
Tk is equivalent to removing the corresponding free stub n(hk), and turning it into a length restriction. 

Lemma 4: Let T, P and T' be defined as above. Let T has minimal total code-word length for P then 
the total code-word length of every T' as above cannot be improved by another partition of the probabilities 
involved among its subtrees. 

Proof: (If) If we could improve the total code word length of T by a redistribution of probabilities among 
the subtrees attached to the free stubs then some T' would not have minimal total code-word length before this 
redistribution. 
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(Only if) If we could improve the total code-word length of any T" by redistribution of the probabilities 
among its subtrees attached to the free stubs involved, then we could also do this in the overall tree T and 
improve its overall total code-word length, contradicting minimality. ■ 

Corollary 1 : If tree T has minimal total code-word length, then every tree T" obtained from it as above 
has minimal total code word length. 

This suggests a way to construct an optimal T by examining every ^-partition corresponding to a candidate 
set of k subtrees (for k := 1, . . . , m), of every initial segment p\ > ... > pj of the probability sequence 
(for j := 1, .. . , M). The minimal expected code-word length tree for the fcth partition element is attached 
to the kth free stub. The crucial observation is that by Corollary 1 the minimal total code word length for 
probabilities Pi \J ■ ■ ■ \J Pk+i using free stub levels hi, ... , hk+i is reached for a binary split in the ordered 
probabilities and free stubs involved, consisting of the minimal total code-word length solution for probabilities 
Pi U ' ' ' U Pk using stub levels h\,. . . ,hk and probabilities Pk+i using free stub level hk+i. Computing the 
optimal minimum code-word lengths of initial probability segments and initial free stub level segments in 
increasing order, this way we find each successive optimum by using previously computed optima. This type 
of computation of a global optimum is called dynamic programming. The following Algorithm A gives the 
precise computation. At termination of the algorithm, the array F\j, k] will contain the minimal expected code 
word length of a tree T[l,j,k] using the largest j (j < M) probabilities pi, . . . ,Pj, optimally divided into 
subtrees attached to the least level k (k < m) free stubs. Thus, T[l,j,k] contains subtrees T[ji,j i+ i,i + 1] 
(0 < i < k, j n = 1 < ji < < jk) each such subtree with n(h i+ i) as the root (0 < i < k). Thus, 
on termination F[M,m] contains the minimal expected code-word length of the desired optimal tree T with 
subtrees if, . . . , with T% rooted at the root of T and using free stubs n(hi), . . . , n(h m ), respectively. (Note, 
T° consists of the previously defined subtree rooted at free stub n(hk), plus the path from n{hk) to the 
root of T.) We can reconstruct these subtrees, and hence the desired code words, from the values of the array 
s on termination. Denote s TO +i = M, s m = s(M,m), Sk = s(sk+i,k) (m > k > 2), and s\ = 0. Then, 
= T[sk + 1, Sfc+i,fc] and F[sk+i,k] equals the expected code-word length of the source words encoded 
in subtrees Tj 3 , . . . ,T£ (1 < k < m). Thus, the array values of s give the desired partition of the ordered 
probability sequence pi > ■ ■ ■ > Pm, and we can trivially construct the tree T and the code-words achieving 
minimal expected code-word length by Huffman's construction on the subtrees. 

Algorithm A 

Input: Given n source words with ordered probabilities and equality length restrictions, first check whether 
(1) is satisfied with _L= oo, otherwise return "impossible" and quit. Compute free stub levels h\ < ■ ■ ■ < h m . 
probabilities pi > ■ ■ ■ > pu as above. 

Step 1: Compute H[i,j, k] as in (7),for all i, j and k(l<i<j<M, I < k < m). 
Step 2: Set F\j, 1] := H[l,j, 1] (1 < j < M). 



DRAFT 



8 



Step 3: for k := 2, . . . , m do 

for j := 1, . . . ,M do 

F[j,k] —min^i^^^fc-l] +H[i + l,j,k}}; 
s[j, k] := io, with i the least i achieving the minimum 

end of Algorithm 

THEOREM 1 : Given n source words with probabilities and equality length restrictions, Algorithm A constructs 
a prefix code with optimal expected code word length in 0(n 3 ) steps. 

Proof: The correctness of the algorithm follows from Corollary 1 and the discussion following it. 

The complexity of computing the hi < ■■■ < h m and pi > ... > pu is 0(n\ogn). Step 1 of the 
algorithm takes 0(M 3 ) + 0(m) steps. First, compute for every i,j the quantities P[i,j] := J2i= r l(Pr)> 
■— J2i= r Prl(r)' There are 0(M 2 ) such quantities and each computation takes O(M) steps. Second, 
for every k compute H[i,j,k] — L[i,j] + h]~P[i,j}. There are m such quantities and each computation takes 
0(1) steps. Step 2 of the Algorithm takes 0(M) steps. Step 3 of the Algorithm involves a outer loop of length 
ui — 1, an inner loop of length M, and inside the nesting the determining of the minimum of < M possibilities; 
overall 0(mM 2 ) steps. The running time of the algorithm is therefore O(nlogn) + 0(M 3 ) + 0{mM 2 ) steps. 
Since M,m < n this shows the stated running time. ■ 

References 

[I] Y.S. Abu-Mostafa and R.J. McEliece, Maximal codeword lengths in Huffman codes, Computers and Mathematics with Applications 
39(2000), 129-134. 

[2] M.B. Baer, Twenty (or so) questions: bounded-length Huffman coding. arXiv:cs.IT/0602085, 2006. 

[3] D. Baron, A.C. Singer, On the cost of worst-case coding length constraints, IEEE Trans. Information Theory, 47:11(2001), 3088-3090. 
[4] R.G. Gallager, Variations on a theme by Huffman, IEEE Trans. Inform. Theory, 24:6(1978), 668-674. 

[5] R.M. Capocelli, A. De Santis, On the redundancy of optimal codes of limited word length, IEEE Trans. Informat. Th., 38:2(1992), 
439^145. 

[6] T.M. Cover and J. A. Thomas. Elements of Information Theory. Wiley & Sons, 1991. 

[7] M. Karpinski, Y. Nekrich, Algorithms for construction of optimal or almost-optimal length-restricted codes, Proc. Data Compression 
Conf. (DCC'05), 2005. 

[8] L.G. Kraft. A device for quantizing, grouping and coding amplitude modulated pulses. Master's thesis, Dept. of Electrical Engineering, 
M.I.T., Cambridge, Mass., 1949. 

[9] L. Larmore, D. Hirschberg, A fast algorithm for optimal length-limited Huffman codes, J. Assoc. Comput. Mack, 37:3(1990), 464^173. 
[10] D.A. Huffman, A method for construction of minimum-redundancy codes, Proceedings IRE, 40:9(1952), 1098-1101. 

[II] G.O.H. Katona and T.O.H. Nemetz, Huffman codes and self-information, IEEE Trans. Informat. Theory, 22:3(1976), 337-340. 

[12] A. Moffat, A. Turpin, J. Katajainen, Space-efficient construction of optimal prefix codes, Proc. Data compression Conf, Snowbird, 
UT, March 1995, 192-201. 

[13] C.E. Shannon. The mathematical theory of communication. Bell System Tech. J., 27(1948),379^t23, 623-656. 



DRAFT 



